Active Learning for Multilingual Statistical Machine Translation

نویسندگان

  • Gholamreza Haffari
  • Anoop Sarkar
چکیده

Statistical machine translation (SMT) models require bilingual corpora for training, and these corpora are often multilingual with parallel text in multiple languages simultaneously. We introduce an active learning task of adding a new language to an existing multilingual set of parallel text and constructing high quality MT systems, from each language in the collection into this new target language. We show that adding a new language using active learning to the EuroParl corpus provides a significant improvement compared to a random sentence selection baseline. We also provide new highly effective sentence selection methods that improve AL for phrase-based SMT in the multilingual and single language pair setting.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Word Embeddings using Multigraphs

We present a family of neural-network– inspired models for computing continuous word representations, specifically designed to exploit both monolingual and multilingual text. This framework allows us to perform unsupervised training of embeddings that exhibit higher accuracy on syntactic and semantic compositionality, as well as multilingual semantic similarity, compared to previous models trai...

متن کامل

Building Strong Multilingual Aligned Corpora

Recent advances have allowed algorithms that learn from aligned natural language texts to exploit aligned sentences in more than two languages. We investigate ways of combining ( N 2 ) bilingual aligned corpora together to create a multilingual aligned corpus across N languages. As a result of the combination of several corpora, our algorithms output a multilingual corpus, with each aligned tup...

متن کامل

Machine Learning Approaches for Dealing with Limited Bilingual Data in Statistical Machine Translation

Statistical machine translation (SMT) systems have made great strides in translation quality. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target language. There are a large number of languages that are considered “low-density”, either because the population speaking the language is not very large, or even if mil...

متن کامل

A Corpus and Semantic Parser for Multilingual Natural Language Querying of OpenStreetMap

We present a corpus of 2,380 natural language queries paired with machine readable formulae that can be executed against world wide geographic data of the OpenStreetMap (OSM) database. We use the corpus to learn an accurate semantic parser that builds the basis of a natural language interface to OSM. Furthermore, we use response-based learning on parser feedback to adapt a statistical machine t...

متن کامل

Harvesting Parallel Text in Multiple Languages with Limited Supervision

The Web is an ever increasing, dynamically changing, multilingual repository of text. There have been several approaches to harvest this repository for bootstrapping, supplementing and adapting data needed for training models in speech and language applications. In this paper, we present semi-supervised and unsupervised approaches to harvesting multilingual text that rely on a key observation o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009